How to introspect the Freebase schema with SPARQL

... as well as other RDF databases

Paul Houle

Creator of database animals and bayesian brains

July 23, 2014

Prelude

Cat/mirror

This post is a followup to How to write SPARQL queries against Freebase data and is part of a series. Subscribe to the RSS feed and to the :BaseKB mailing list for future episodes.

The data set I'm using is the 2014-03-02 edition of :BaseKB Gold. You can download this via Bittorrent and load it into any standard-complaint triple store, but it's even faster to use the pre-loaded Compact Edition which can deploy perfectly matched hardware, software and data in just one click.

The need for schema inspection

Crazy graph

It's hard to imagine a data model simpler than RDF, based on two concepts, the Node and the Triple. What is hard is getting a handle around a database that has 800 million facts and 60,000 properties. If you don't know which properties to use, the data you want can be hidden in plain site. Thus, one of the big problems in using this kind of database is understanding its contents.

Opening the T-Box

Silvrback blog image

Because RDF schemas are expressed in RDF, you can use ordinary SPARQL questions to ask questions about schemas. In our last episode, we used the

:geography.river.location

property, so let's take a look at the schema information for this SPARQL query, which displays all triples that have :geography.river.location as a subject.

prefix : <http://rdf.basekb.com/ns/>

select ?p ?o {
   :geography.river.length ?p ?o .
}

this gives the following result

River Result

Note here we get a mix of RDFS and OWL vocabulary together with some Freebase-specific vocabulary. Note that instead of the standardxsd:float, the data type of the object (rdfs:range) is recorded as :type.float. The Freebase property has the :type.property.expected_type that corresponds to rdfs:range and :type.property.schema that corresponds to rdfs:domain.

Note that the triple store I'm running has RDFS inference turned off, so we are seeing facts asserted in Freebase, not facts that could be inferred via RDFS or OWL.

Knowing that freebase properties typically have

:domain.type.property

structure you'd imagine now that Freebase encodes schema data under the :type domain and you'd be right. Look at

https://www.freebase.com/type?schema=

and you'll find a number of types which represent metainformation about Freebase. As you'll see later, you can also find schema information in other Freebase domains.

Let'stake a look at the schema for :type.type, the Freebase version of rdfs:Type, we see

https://www.freebase.com/type/type?schema=

Type schema

Facts with the :type.type.instance predicate have been deleted from :BaseKB because they are bulky (there is one for each a in the system) and also because you end up with some subjects having millions of :type.type.instance facts, which blows out memory if you try to bring all facts with a given subject together in one place.

Double Vision

Silvrback blog image

Yes, the image has a backstory

If we take a look at facts concerning :geography.river

prefix : <http://rdf.basekb.com/ns/>

select ?p ?o (lang(?o) AS ?lang) {
  :geography.river ?p ?o .
}

we see we have labels in many languages

River Facts

but many of the facts that we'd expect to have in the schema (such as those that apply to :type.type) are missing. It turns out that many of these are registered under the mid identifier for :geography.river, which we can find by doing a key lookup

prefix : <http://rdf.basekb.com/ns/>

select ?riverMid {
  ?riverMid :type.object.key "/geography/river"
}

We get

Mid result

Next I'll look up the facts with :m.01xs05k on the left-hand-side, excluding the labels (which are the same as the other labels) so the results fit in a screenshot

prefix : <http://rdf.basekb.com/ns/>

select ?p ?o {
  :m.01xs05k ?p ?o .
   FILTER(?p != rdfs:label) 
}

River Properties

Looking at these facts we see find all kinds of cool stuff, including summary data.

It's straightforward now to write SPARQL to answer questions about the schema, for instance, we can get a list of properties

prefix : <http://rdf.basekb.com/ns/>

select ?propId ?propLabel {
  :m.01xs05k :type.type.properties ?prop .
  ?prop :type.object.id ?propId .
  ?prop rdfs:label ?propLabel .
  FILTER(LANG(?propLabel)='en') 
}

with the following results:

Property list

Just to make clear what is going on, you'll never actually see :m.01xs05k in the predicate field

prefix : <http://rdf.basekb.com/ns/>

select (count(*) as ?cnt) {
  ?s :m.01xs05k ?o . 
}

Zero results

The weirdness here is because of the nature of Freebase.

The "real" identifiers in Freebase are the mid identifiers, which are more-or-less sequential integers. graphd, the internal Freebase database, resolves names like '/geography/river' to mids when processing queries, and then converts names back to 'human friendly' form for display.

A major difficulty with processing the old freebase quad dump was that the quad dump did not use consistent identifiers in the various fields, which meant that it was not possible to do any processing that joined the schema with the data. The original :BaseKB fixed this problem by resolving all identifiers to mids but this mean that queries looked like this:

prefix : <http://rdf.basekb.com/>

select ?river ?length {
   ?river :m.01xs0f4 ?length .
   ?river :m.014h :m.06bnz .
} ORDER BY DESC(?length) LIMIT 1

Writing queries like this is a bit like coding in assembly language; the superficial difficulties can be fixed by rewriting queries to imitate graphd's name resolution behavior, but when Freebase switched to an official RDF dump, they commited to using consistent identifiers for predicates and :BaseKB followed.

Included Types

Tree of types

Another funny thing about Freebase is the concept of an 'Included Type' which is similar, but not quite identical, to the RDFs concept of an 'Included Type'. If we turtleize the relevant properties of the :geography.river mid, we get

:m.01xs05k
    :freebase.type_profile.strict_included_types
        :m.01n7 , :m.02h5yxm ;
    :freebase.type_hints.included_types
       :m.01m7,  :m.01y2jks, :m.01c5, :m.02h5yxm .

The included_types property has been around since Jan 2007, strict_included_types is newer, created in Feb 2013.

The original included_types came out of the requirements for a community-edited database. For instance, ':people.person' is an included type of ':book.author' because the author of a book is usually a person. This means that when somebody adds an author to a book, Freebase automatically assumes that this a person. Although it's not factually true that authors are always people, it's true enough that we get better results assuming this rather than expecting users to tag authors as persons manually.

(If we believe Freebase, there are 4360 authors who are not people, out of 533,452. The query below counts the un-people.)

prefix : <http://rdf.basekb.com/ns/>

select count(*) as ?cnt {
   ?author a :book.author .
   minus {
      ?author a :people.person .
   }   
}

In the case of river, we can look up the included types like so,

prefix : <http://rdf.basekb.com/ns/>

select ?that ?id {
   ?that :type.object.id ?id .
   :m.01xs05k :freebase.type_hints.included_types ?that .
}

Next steps

Paw Printz

Since schema information can be expressed in RDF, RDF schemas can be explored using SPARQL.

Although Freebase uses some standard vocabulary, most schema information is expressed with non-standard vocabulary. This makes sense, since the Freebase schema supports a collaborative editing interface rather than RDFS inference.

It takes just a little knowledge, outlined in this article and documented in Freebase, to ask questions about the Freebase schema in SPARQL. This knowledge can be the basis for RDF-based Freebase browsing interfaces (to be discussed in a future post), conversion to RDFS/OWL schemas that can be used with tools like Protégé, as well as hand-written SPARQL queries.

This post is the first of a series: future posts will cover compound value types, how to look up identifiers, and other topics. Subscribe to our RSS feed and the :BaseKB mailing list.